library(readr)
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(data.world)
##
## Attaching package: 'data.world'
## The following object is masked from 'package:dplyr':
##
## query
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.2
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:magrittr':
##
## set_names
## The following objects are masked from 'package:dplyr':
##
## contains, order_by
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.2
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(openxlsx)
library(choroplethr)
## Warning: package 'choroplethr' was built under R version 3.3.2
## Loading required package: acs
## Loading required package: stringr
## Warning: package 'stringr' was built under R version 3.3.2
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Loading required package: XML
##
## Attaching package: 'acs'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:base':
##
## apply
library(choroplethrMaps)
## Warning: package 'choroplethrMaps' was built under R version 3.3.2
library(plotly)
## Warning: package 'plotly' was built under R version 3.3.2
##
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Economic inequality has been called “the defining challenge of our time”, and indeed it’s hard to understate the scale of the phenomenon: according to Oxfam’s latest estimate, the eight richest men in the world own as much wealth as the poorest 3.5 billion people. In november 2016, Donald J. Trump was elected President of the United States. It’s a major shift in US politics.
Are these two historic phenomena somehow related? It’s natural to assume that a growing frustration with inequality could have fueled the Trump vote. A study by Bruegel concluded in that direction.
County-level data provide fine-grained information to test this hypothesis, and that’s what I’ll do here—after recalling some basic facts about the 2016 US election ifself.
It’s well known that Trump lost the popular vote. This, as well as the media focus on “swing states”, suggests that the election was a close call. At the county level, though, it was anything but.
Let’s begin by visualizing the election results county by county, using the data at data.world.
## Get data from data.world
conn <- data.world()
presResults2016 <- data.world::query(conn,
dataset = 'data4democracy/election-transparency',
query = "SELECT * FROM PresidentialElectionResults2016")
GOP_results <- presResults2016 %>%
select(County, rPct) %>%
rename(region = County)
GOP_results %>%
rename(value = rPct) %>%
county_choropleth(title = 'Trump vote across US counties')
## Warning in super$initialize(map.df, user.df): Your data.frame contains the
## following regions which are not mappable: 46102, 2158
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 46113, 15005, 51515, 2270
Summary statistics for Trump votes are telling—the median Trump vote is 66%, and more than 80% of counties gave Trump a supermajority:
summary(GOP_results$rPct)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0407 0.5424 0.6623 0.6307 0.7484 0.9458
absolute_maj <- filter(GOP_results, rPct > .5)
100* length(absolute_maj$rPct) / length(GOP_results$rPct)
## [1] 80.19739
plot_ly(x = GOP_results$rPct, type = "histogram")
Thus, when viewed from the perspective of county-level voting patterns, Trump’s victory was a landslide. He won almost everywhere, generally with a very large margin.
The Economic Policy Institute published a report with county-level data on economic inequality for 2010 - 2013: Sommeiller et al (2016). I’ll start from there.
#income_county_raw <- read.xlsx("~/ML/election-transparency/data-raw/sommeiller_et_al_2016/top-incomes-since-1917_vs2013-06-15-2016.xlsx", sheet = #"Data_CO")
income_county_raw <- read.xlsx("http://go.epi.org/unequalstates2016data", sheet = "Data_CO")
income_county <- income_county_raw[c("cntyname","year", "cofips","SP99_100")] %>%
rowwise() %>% mutate(SP99_100 = as.numeric(SP99_100)) %>%
rename(region = cofips) %>%
rowwise() %>% mutate(region = str_sub(region, 1, -4)) %>%
rowwise() %>% mutate(region = as.numeric(region))
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
The variablel SP99_100 is the share of total income captured by the top 1%: a simple measure of economic inequality. There’s quite a bit of variation across US counties, with a coefficient of variation of 37%.
top_1_percent_2013 <- income_county %>%
filter(year == 2013) %>%
select(- year, - cntyname) %>%
drop_na
sd(top_1_percent_2013$SP99_100)/mean(top_1_percent_2013$SP99_100)
## [1] 0.3740631
plot_ly(top_1_percent_2013, x = ~SP99_100, type = "histogram") %>%
layout(title = "Distribution income share of top 1% across US counties (2013)")
Let’s make a map of income inequality for (say) 2013.
top_1_percent_chlor <- top_1_percent_2013 %>%
rename(value = SP99_100) %>%
rowwise() %>% mutate(region = as.numeric(region))
county_choropleth(top_1_percent_chlor, title = "Income share of the top 1% across US counties")
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 31115, 48301, 15005, 51515
If Trump voters aren’t particularly sensitive to the level of economic inequality in their county, could it be that they care about its change? Let’s look at the relative change of the share of income captured by the top 1% between 2010 and 2013:
ineq_GOP <- ineq_GOP %>%
rename(SP99_100_13 = SP99_100)
top_1_percent_2010 <- income_county %>%
filter(year == 2010 )%>%
select(- year, - cntyname) %>%
rename(SP99_100_10 = SP99_100)
ineq_GOP <- join(ineq_GOP, top_1_percent_2010, by = "region") %>% drop_na
ineq_GOP <- ineq_GOP %>% mutate(SP99_100_change = (SP99_100_13 - SP99_100_10)/SP99_100_10)
plot_ly(ineq_GOP, x = ~SP99_100_change, type = "histogram") %>%
layout(title = "Distribution of variation of income share of top 1% across US counties (2010 - 2013)")
Big changes in just three years! In some counties, the share of the top 1% has increased by up to 50%; in others, it has decreased by similar amounts.
Is the Trump vote correlated with SP99_100_change? No:
plot_ly(ineq_GOP, x=~rPct, y = ~SP99_100_change)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
cor.test(x=ineq_GOP$rPct, y = ineq_GOP$SP99_100_change)
##
## Pearson's product-moment correlation
##
## data: ineq_GOP$rPct and ineq_GOP$SP99_100_change
## t = 2.6553, df = 3131, p-value = 0.007965
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01240202 0.08228198
## sample estimates:
## cor
## 0.0474
The notebook kht_modeling_results.Rmd presents an analysis of the Trump vote in terms of county characteristics, with a random forest regressor explaining 82% of the variance. I’ll reuse the pro-processing done there:
# pre-processing of county characteristics from kht_modeling_results.Rmd
## Get data from data.world
conn <- data.world()
countyChar <- data.world::query(conn,
dataset = 'data4democracy/election-transparency',
query = "SELECT * FROM CountyCharacteristics")
voterReg2016 <- data.world::query(conn,
dataset = 'data4democracy/election-transparency',
query = "SELECT * FROM PartyRegistration WHERE Year = 2016 AND Month = 11")
presResults2016 <- data.world::query(conn,
dataset = 'data4democracy/election-transparency',
query = "SELECT * FROM PresidentialElectionResults2016")
## Prep tables and join them
voterReg2016 <- voterReg2016 %>%
select(-one_of("CountyName", "StateName", "StateAbbr", "Year", "Month", "YearMonth"))
names(voterReg2016) <- ifelse(names(voterReg2016) %in% c('State', 'County'), names(voterReg2016),
paste0(names(voterReg2016), 'Reg'))
data2016 <- reduce(list(countyChar, voterReg2016, presResults2016),
left_join,
by = c('County', 'State'))
## @jenniferthompson's feature engineering
prop_total <- function(x){ x / data2016$TotalPopulation }
data2016 <- data2016 %>%
mutate(propMale = prop_total(Male),
propKids = prop_total(Age0_4 + Age5_9 + Age10_14 + Age15_19),
propAdultsNoTeens = 1 - propKids,
totalAdultsWithTeens = Age15_19 + Age20_24 + Age25_34 + Age35_44 + Age45_54 + Age55_59 +
Age60_64 + Age65_74 + Age75_84 + Age85,
propAdultsWithTeens = prop_total(totalAdultsWithTeens),
totalAdultsNoTeens = Age20_24 + Age25_34 + Age35_44 + Age45_54 + Age55_59 + Age60_64 +
Age65_74 + Age75_84 + Age85,
propElders = prop_total(Age65_74 + Age75_84 + Age85),
propNMarried = NeverMarried / totalAdultsWithTeens,
propHispanic = prop_total(Hispanic),
propWhite = prop_total(White),
propBlack = prop_total(Black),
majWhite = propWhite > 0.5,
majBlack = propBlack > 0.5,
propNoHS = (EdK8 + Ed9_12) / totalAdultsNoTeens,
propHS = EdHS / totalAdultsNoTeens,
propMoreHS = (EdCollNoDegree + EdAssocDegree + EdBachelorDegree + EdGraduateDegree) /
totalAdultsNoTeens,
propMfg2015 = MfgEmp2015 / LaborForce,
propUnemp = Unemployment / LaborForce,
propLaborForce = prop_total(LaborForce),
propStein = stein / totalvotes,
propJohnson = johnson / totalvotes,
propVoters = totalvotes / totalAdultsNoTeens)
data2016 <- data2016 %>%
mutate(propUninsured = prop_total(Uninsured),
propForeignBorn = prop_total(ForeignBorn),
propNonCitizen = prop_total(NonCitizen),
propDisability = prop_total(Disability),
propTotalSSI = prop_total(TotalSSI),
propAgedSSI = prop_total(AgedSSI),
propBlindDisabledSSI = prop_total(BlindDisabledSSI),
propOASDI = prop_total(OASDI),
propMfg1970 = MfgEmp1970 / TotalEmp1970,
propMfg1980 = MfgEmp1980 / TotalEmp1980,
propMfg1990 = MfgEmp1990 / TotalEmp1990,
propMfg2001 = MfgEmp2001 / TotalEmp2001)
for_big_rf <- data2016 %>%
select(rDRPct, County, # Objective function and an index for joining later
MedianHouseholdIncome, TotalPopulation, MedianAge, LandAreaSqMiles, # Big dumb basic stats
propMale, propKids, propAdultsNoTeens, propNMarried, propForeignBorn, propNonCitizen, # Demography
propHispanic, propWhite, propBlack, majWhite, majBlack, SimpsonDiversityIndex, # Racial demography
propNoHS, propHS, propMoreHS, # Education
propMfg1970, propMfg1980, propMfg1990, propMfg2001, propMfg2015, propUnemp, propLaborForce, # Labor
propVoters, propJohnson, propStein, # Political (avoiding registration b/c of partyless reg. issue)
MedianHousingCosts, MedianHouseholdIncome, propUninsured, # Financial
propDisability, propTotalSSI, propAgedSSI, propBlindDisabledSSI, propOASDI, # SSI recipients
NCHS_UrbanRural1990, NCHS_UrbanRural2006, NCHS_UrbanRural2013) %>% # Area classifications
# RF can't handle strings
mutate(NCHS_UrbanRural1990 = factor(NCHS_UrbanRural1990),
NCHS_UrbanRural2006 = factor(NCHS_UrbanRural2006),
NCHS_UrbanRural2013 = factor(NCHS_UrbanRural2013),
propStein = ifelse(is.na(propStein), 0, propStein)) %>% # Where Stein wasn't on the ballot, we'll fill in 0
# Can't handle NA either
filter(!is.na(MedianHouseholdIncome), !is.na(propTotalSSI), !is.na(propAgedSSI), !is.na(propBlindDisabledSSI), !is.na(propOASDI),
!is.na(propMfg1970), !is.na(propMfg1980), !is.na(propMfg1990), !is.na(propMfg2001), !is.na(propMfg2015),
!is.na(NCHS_UrbanRural1990), !is.na(NCHS_UrbanRural2013), !is.na(NCHS_UrbanRural2006))
# This results in the loss of 815 of 3,141 counties because of missing data. Not good but maybe we can get some of those back if we can show that the missing variables aren't predictive.
Next I add the economic inequality variables:
ineq <- ineq_GOP %>%
select(region, SP99_100_13, SP99_100_change) %>%
rename(County = region)
for_big_rf <- join(for_big_rf, ineq, by = "County") %>% drop_na
And re-train the random forest:
# Train/test split
trIndex <- createDataPartition(for_big_rf$rDRPct, p = 0.8, list = F)
tr <- select(for_big_rf, -County)[trIndex,]
te <- select(for_big_rf, -County)[-trIndex,]
# Train our RF
big_rf <- randomForest(rDRPct ~ ., tr)
# How's it perform? Looking at MSE here
big_rf
##
## Call:
## randomForest(formula = rDRPct ~ ., data = tr)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 0.004125477
## % Var explained: 82.37
The addition of economic inequality does not increase the explanatory power of the model. When we look at variable importance, we find that SP99_100_13 (the level in 2013) and SP99_100_change (the relative change 2010-2013) are largely irrelevant:
```
varImpPlot(big_rf, n.var = 44)
Although both economic inequality and Trump vote have large variations across the US, there appears to be no clear association between the two variables—at least not at the county level. This conclusion is confirmed by an analysis, at the state level, of the relationship between the number of billionaires and Trump vote (done elsewhere; I may add it to this notebook later on).
This is not say that the growth of economic inequality is not playing a key role in US politics. First, as noted by Bruegel, there is the possibility that economic inequality correlates with the growth in Republican votes over time. I didn’t look at such longitudinal trends here (though all the ingredients are in the notebook). Second, it is easy to imagine that voters react to a diffuse perception that inequalities are getting worse globally, even if they are not sensitive to the level or change of inequality in their own counties. To caricature, my analysis does not exclude that voters in Alabama voted for Trump as a consequence of growing inequalities in the San Francisco Bay Area.
Still, a negative result debunking a preconception can be as important as a positive result revealing a pattern. I hope that this quick-and-dirty notebook can spur an interesting discussion at D4D.